Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

نویسندگان

Christopher Andreas Clark

Santosh Kumar Divvala

چکیده

Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many “off-the-shelf” tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract tables, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reasoning about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article’s text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for leveraging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96% precision at 92% recall when tested against this dataset, surpassing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Full Text and Figure Display Improves Bioscience Literature Search

When reading bioscience journal articles, many researchers focus attention on the figures and their captions. This observation led to the development of the BioText literature search engine, a freely available Web-based application that allows biologists to search over the contents of Open Access Journals, and see figures from the articles displayed directly in the search results. This article ...

متن کامل

Running Head : TARGET INFERENCE Algorithmic Inference of Visual Search Targets from Eye Movements

232 words Main text (including appendix): 2201 words Figure captions: 307 words Figures: 4 Tables: 0 References: 16

متن کامل

Extracting information from text and images for location proteomics

There is extensive interest in automating the collection, organization and summarization of biological data. Data in the form of figures and accompanying captions in literature present special challenges for such efforts. Based on our previously developed search engines to find fluorescence microscope images depicting protein subcellular patterns, we introduced text mining and Optical Character...

متن کامل

Enhanced Browsing System for Electronic Theses and Dissertations

Electronic Theses and Dissertations (ETDs) can be a valuable aid to learning and scholarship. However, current systems that provide access to ETDs only provide a full text and/or metadata based search and browse facility, thereby limiting ways in which users can interact with and make use of such collections. Long documents like ETDs can be viewed as containing various streams of information te...

متن کامل

The effects of captioning texts and caption ordering on L2 listening comprehension and vocabulary learning

This study investigated the effects of captioned texts on second/foreign (L2) listening comprehension and vocabulary gains using a computer multimedia program. Additionally, it explored the caption ordering effect (i.e. captions displayed during the first or second listening), and the interaction of captioning order with the L2 proficiency level of language learners in listening comprehension a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

نویسندگان

چکیده

منابع مشابه

Full Text and Figure Display Improves Bioscience Literature Search

Running Head : TARGET INFERENCE Algorithmic Inference of Visual Search Targets from Eye Movements

Extracting information from text and images for location proteomics

Enhanced Browsing System for Electronic Theses and Dissertations

The effects of captioning texts and caption ordering on L2 listening comprehension and vocabulary learning

عنوان ژورنال:

اشتراک گذاری